In this project we investigate the Gapminder CO2 emissions data as well as other variables; life expectency, population, gdp, and, income - all variable data sets are known drivers of CO2 emissions -- we then compare with data of airline carrier departures and passanger flights to help us understand CO2 emissions.
In late 2020 Airbus announced three new Hydrogen, "zero-emission' concept aircrafts, each representing a different approach to exploring various technology pathways towards meeting future climate-neutral targets as set by the company and in accorence with European Green Deal standards. The company expressed that the time is prime for new develoment in areas such as Hydrogen to take hold so that companies such as Airbus and the overall industry governments can work in accordance to meet innovation requirments and move towards an entirely new way to fly by 2030.
https://ec.europa.eu/info/strategy/priorities-2019-2024/european-green-deal_en
We hope to gain insights into known drivers of CO2 emissions across countries and regions including identify potential connections between known driving indicators of increased CO2 emissions as well as flight data and see how total passangers and total departures.
Natrurally, I wanted to check if there are any potential relationship beween flights and CO2 emissions, but before we analyze flights we pulled in known variables life expectancy, gdp, population and income.
Questions we would like to answer in this exploration are the following:
# In this cell are the import statements for all of the packages used
import plotly
import numpy as np
import pandas as pd
import itertools
import os
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# %matplotlib inline
pd.options.display.float_format = "{:,.2f}".format
In this section of the report, the data is laoded, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.
including a list of continents grouped counties by continents for further analysis
path = '/Users/coryrobbins/projects/AIRBUS_Data-Analyst-Nanodegree_Udacity/M1_intro_data_analysis/project1/gapminder_data/'
continents = pd.read_csv(path + 'Countries-Continents.csv')
continents.set_index("Country", inplace=True)
co2 = pd.read_csv(path + 'co2_emissions_tonnes_per_person.csv', index_col='country')
income = pd.read_csv(path + 'income_per_person_gdppercapita_ppp_inflation_adjusted.csv', index_col='country')
life = pd.read_csv(path + 'life_expectancy_years.csv', index_col='country')
population = pd.read_csv(path + 'population_total.csv', index_col='country')
gdp = pd.read_csv(path + 'total_gdp_ppp_inflation_adjusted.csv', index_col='country')
consumption = pd.read_csv(path + 'consumption_co2_emissions_1000_tonnes.csv', index_col='country')
dprt =pd.read_csv(path + 'is_air_dprt.csv', index_col='country')
psgr =pd.read_csv(path + 'is_air_psgr.csv', index_col='country')
all_data = {
'co2' : co2,
'income' : income,
'life' : life,
'population' : population,
'gdp' : gdp,
}
"""
'consumption' : consumption,
'dprt' : dprt,
'psgr' : psgr,
"""
all_data = {name:value.transpose() for name, value in all_data.items()}
Now we want a flatten (tall and skinny) representation of the data to be more similar to other online resoources.
For this we need the stack / unstack operations.
all_data.keys()
all_data['continent'] = continents #['Continent']
# means['continent'] = continents['Continent']
name = "life"
datanow = all_data[name].transpose()
datanow["continent"] = all_data["continent"]
datanow = datanow.reset_index().set_index(['country', 'continent'])
# datanow = datanow.transpose()
datanow = (
datanow
.stack()
.reset_index()
.rename(columns={"level_2":"year", 0:name})
.set_index( ["continent", "country", "year"]) #.transpose()
)
datanow.astype(int)
| life | |||
|---|---|---|---|
| continent | country | year | |
| Asia | Afghanistan | 1800 | 28 |
| 1801 | 28 | ||
| 1802 | 28 | ||
| 1803 | 28 | ||
| 1804 | 28 | ||
| ... | ... | ... | ... |
| Africa | Zimbabwe | 2096 | 75 |
| 2097 | 75 | ||
| 2098 | 75 | ||
| 2099 | 75 | ||
| 2100 | 75 |
55528 rows × 1 columns
combined = pd.DataFrame()
for name in all_data:
if name == "continent":
continue
datanow = all_data[name].transpose()
datanow["continent"] = all_data["continent"]
datanow = datanow.reset_index().set_index(['country', 'continent'])
# datanow = datanow.transpose()
datanow = (
datanow
.stack()
.reset_index()
.rename(columns={"level_2":"year", 0:name})
.set_index( ["continent", "country", "year"]) #.transpose()
)
combined[name] = datanow[name]
combined.head()
combined = combined.reset_index()#.fillna(0)
combined.sample(n=20)
| continent | country | year | co2 | income | life | population | gdp | |
|---|---|---|---|---|---|---|---|---|
| 16335 | Asia | Turkey | 2018 | 5.20 | 25,300.00 | 79.20 | 82300000 | nan |
| 1964 | North America | Belize | 1978 | 1.56 | 3,930.00 | 69.20 | 139000 | 519,000,000.00 |
| 13395 | Oceania | Samoa | 2000 | 0.82 | 4,330.00 | 71.70 | 174000 | 696,000,000.00 |
| 6549 | North America | Grenada | 1984 | 0.63 | 4,640.00 | 68.60 | 98500 | 525,000,000.00 |
| 3844 | NaN | Congo, Rep. | 1982 | 0.71 | 6,100.00 | 53.80 | 1880000 | 11,100,000,000.00 |
| 793 | Oceania | Australia | 1920 | 4.88 | 8,660.00 | 60.60 | 5270000 | 40,800,000,000.00 |
| 6047 | Europe | Georgia | 1977 | 5.00 | 7,690.00 | 68.20 | 4930000 | 34,300,000,000.00 |
| 6771 | Africa | Guinea-Bissau | 1998 | 0.15 | 1,380.00 | 50.20 | 1150000 | 1,390,000,000.00 |
| 13421 | Africa | Sao Tome and Principe | 1958 | 0.12 | 1,500.00 | 51.20 | 62300 | 76,900,000.00 |
| 5857 | Africa | Gambia | 1954 | 0.05 | 1,250.00 | 41.40 | 324000 | 310,000,000.00 |
| 13252 | NaN | Russia | 1986 | 16.40 | 21,000.00 | 70.00 | 144000000 | 2,750,000,000,000.00 |
| 10591 | Africa | Morocco | 1941 | 0.05 | 2,000.00 | 35.30 | 7800000 | 16,700,000,000.00 |
| 2889 | Africa | Cameroon | 1994 | 0.21 | 2,340.00 | 56.50 | 13200000 | 26,600,000,000.00 |
| 7059 | Europe | Hungary | 1890 | 0.80 | 3,580.00 | 36.00 | 6610000 | 24,100,000,000.00 |
| 6949 | North America | Honduras | 1969 | 0.45 | 2,760.00 | 58.00 | 2640000 | 8,030,000,000.00 |
| 6479 | Europe | Greece | 1983 | 5.56 | 18,000.00 | 77.60 | 9860000 | 170,000,000,000.00 |
| 2073 | Asia | Bhutan | 1977 | 0.02 | 1,110.00 | 52.00 | 371000 | 395,000,000.00 |
| 10370 | Europe | Moldova | 1940 | 1.44 | 2,070.00 | 46.60 | 2120000 | 4,610,000,000.00 |
| 14876 | NaN | St. Vincent and the Grenadines | 1963 | 0.17 | 2,830.00 | 61.10 | 84200 | 271,000,000.00 |
| 16106 | Africa | Tunisia | 1943 | 0.02 | 1,780.00 | 27.50 | 3240000 | 5,600,000,000.00 |
list(combined)
['continent', 'country', 'year', 'co2', 'income', 'life', 'population', 'gdp']
combined.groupby('continent').mean()
| co2 | income | life | population | gdp | |
|---|---|---|---|---|---|
| continent | |||||
| Africa | 0.89 | 3,698.60 | 53.07 | 11,801,417.93 | 39,238,976,864.49 |
| Asia | 5.16 | 13,256.68 | 54.31 | 71,474,855.70 | 281,981,128,982.90 |
| Europe | 4.17 | 11,170.86 | 57.36 | 13,115,810.07 | 151,506,128,864.06 |
| North America | 3.46 | 8,200.75 | 63.15 | 7,722,918.67 | 86,712,467,902.05 |
| Oceania | 3.40 | 9,048.98 | 61.71 | 2,416,283.92 | 43,774,635,180.72 |
| South America | 1.70 | 7,102.58 | 58.17 | 17,946,754.58 | 143,677,907,180.39 |
combined.dtypes
continent object country object year object co2 float64 income float64 life float64 population int64 gdp float64 dtype: object
combined['year'].dtype
dtype('O')
using .isna() and .sum() to compare with the shape to analyze the number of missing-values
combined.isna().sum()
continent 1831 country 0 year 0 co2 0 income 29 life 200 population 0 gdp 978 dtype: int64
combined.shape
(18039, 8)
#importing missingn package
import missingno as msno
missingdata_df = combined.columns[combined.isnull().any()].tolist()
msno.matrix(combined[missingdata_df])
<AxesSubplot:>
#Drop NA values where cells have NAs and filled them with zero
combined_filled = combined.fillna(0)
filled_zeros = combined.fillna(0)
filled_zeros.sort_values(["year", "co2",], ascending=False)
| continent | country | year | co2 | income | life | population | gdp | |
|---|---|---|---|---|---|---|---|---|
| 12955 | Asia | Qatar | 2018 | 38.00 | 113,000.00 | 80.30 | 2780000 | 0.00 |
| 16086 | North America | Trinidad and Tobago | 2018 | 31.30 | 28,600.00 | 74.40 | 1390000 | 0.00 |
| 8770 | Asia | Kuwait | 2018 | 23.70 | 65,500.00 | 83.20 | 4140000 | 0.00 |
| 16827 | Asia | United Arab Emirates | 2018 | 21.40 | 66,600.00 | 73.50 | 9630000 | 0.00 |
| 1414 | Asia | Bahrain | 2018 | 19.80 | 42,000.00 | 79.60 | 1570000 | 0.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 16828 | Europe | United Kingdom | 1800 | 2.48 | 3,280.00 | 38.60 | 10800000 | 35,700,000,000.00 |
| 12518 | Europe | Poland | 1800 | 0.05 | 1,100.00 | 35.90 | 9000000 | 14,000,000,000.00 |
| 6089 | Europe | Germany | 1800 | 0.04 | 1,990.00 | 38.40 | 18000000 | 46,900,000,000.00 |
| 17047 | 0 | United States | 1800 | 0.04 | 1,980.00 | 39.40 | 6000000 | 14,300,000,000.00 |
| 2914 | North America | Canada | 1800 | 0.01 | 1,310.00 | 39.00 | 500000 | 846,000,000.00 |
18039 rows × 8 columns
combined["country"].describe()
count 18039 unique 192 top United States freq 219 Name: country, dtype: object
In this section we will explore the data in differnet ways and to compute statistics and create visualizations with the goal of addressing the research questions posed in the Introduction section.
It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.
Fasters growing countries of producers of Carbon Dioxide per capita
top_25_growers = (co2.max() - co2.min()) / co2.min()
top_25_growers
1800 337.34
1801 342.58
1802 350.00
1803 398.41
1804 405.02
...
2014 1,124.33
2015 1,124.34
2016 1,514.75
2017 1,630.15
2018 1,562.79
Length: 219, dtype: float64
top_25_columns = {}
for name, df in all_data.items():
top_25 = df.mean().dropna().sort_values().iloc[-25:]
top_25_columns[name] = top_25.index
top_25_columns
{'co2': Index(['Liechtenstein', 'Libya', 'Slovenia', 'Andorra', 'Canada', 'Australia',
'Iceland', 'Czech Republic', 'United Kingdom', 'Belgium', 'Oman',
'Singapore', 'Saudi Arabia', 'United States', 'Nauru', 'Bahamas',
'Estonia', 'Palau', 'Trinidad and Tobago', 'Bahrain', 'Brunei',
'Kuwait', 'Luxembourg', 'United Arab Emirates', 'Qatar'],
dtype='object', name='country'),
'income': Index(['France', 'United Kingdom', 'Austria', 'Belgium', 'Sweden', 'Canada',
'Germany', 'Bahrain', 'Australia', 'Andorra', 'Denmark', 'Netherlands',
'Ireland', 'Saudi Arabia', 'United States', 'San Marino', 'Monaco',
'Singapore', 'Norway', 'Switzerland', 'Kuwait', 'Luxembourg',
'United Arab Emirates', 'Brunei', 'Qatar'],
dtype='object', name='country'),
'life': Index(['Italy', 'Austria', 'Greece', 'Japan', 'Cyprus', 'Finland', 'Germany',
'Luxembourg', 'Marshall Islands', 'New Zealand', 'United States',
'France', 'Ireland', 'Australia', 'Belgium', 'Iceland',
'United Kingdom', 'Netherlands', 'Canada', 'Switzerland', 'Denmark',
'Sweden', 'Norway', 'Dominica', 'Andorra'],
dtype='object', name='country'),
'population': Index(['Uganda', 'Kenya', 'Italy', 'Iran', 'Turkey', 'Vietnam', 'France',
'United Kingdom', 'Tanzania', 'Philippines', 'Germany', 'Egypt',
'Mexico', 'Japan', 'Congo, Dem. Rep.', 'Ethiopia', 'Bangladesh',
'Russia', 'Brazil', 'Pakistan', 'Indonesia', 'Nigeria', 'United States',
'India', 'China'],
dtype='object', name='country'),
'gdp': Index(['Argentina', 'Nigeria', 'Egypt', 'Ukraine', 'Netherlands', 'Australia',
'South Korea', 'Iran', 'Poland', 'Saudi Arabia', 'Turkey', 'Canada',
'Spain', 'Indonesia', 'Mexico', 'Brazil', 'Italy', 'United Kingdom',
'France', 'India', 'Russia', 'Germany', 'Japan', 'China',
'United States'],
dtype='object', name='country'),
'continent': Index([], dtype='object')}
In order to quickly compare known indicators we plotted the top-25 growers of CO2 emmiting countries to understand the data better as a starting to point
plt.figure(figsize=(18, 50), tight_layout=True)
for num, name_df in enumerate(all_data.items(), start=1):
name, df = name_df
plt.subplot(10,2,num)
plt.tight_layout()
df[top_25_columns[name]].boxplot(vert=False)
plt.title(name)
In order to plot the variables in a correlation coefficient heat map so we can see a highly correlated variable to the other we need to first group the data into means and putting it into its own statistics dataframe.
# Want to find the averages of all the indicators and plot them and then generate a stats dataframe
means = {name:df.mean() for name, df in all_data.items()}
means['continent'] = continents['Continent']
means = pd.DataFrame(means)
means.sample(n=5)
| co2 | income | life | population | gdp | continent | |
|---|---|---|---|---|---|---|
| Tuvalu | 0.91 | 1,697.60 | nan | 8,001.33 | 9,836,359.22 | Oceania |
| North Korea | 3.57 | 839.33 | 50.94 | 13,834,352.16 | 11,986,901,408.45 | NaN |
| Spain | 2.03 | 11,116.80 | 58.78 | 28,366,777.41 | 241,755,607,476.64 | Europe |
| Colombia | 1.09 | 4,509.61 | 56.14 | 23,149,734.22 | 67,426,485,981.31 | South America |
| Cameroon | 0.21 | 1,676.29 | 46.70 | 20,007,508.31 | 8,827,943,925.23 | Africa |
Run correlation matrix to see how closely related the indicators are
corr = means.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
| co2 | income | life | population | gdp | |
|---|---|---|---|---|---|
| co2 | 1.00 | 0.83 | 0.32 | -0.07 | 0.08 |
| income | 0.83 | 1.00 | 0.62 | -0.06 | 0.21 |
| life | 0.32 | 0.62 | 1.00 | -0.08 | 0.21 |
| population | -0.07 | -0.06 | -0.08 | 1.00 | 0.55 |
| gdp | 0.08 | 0.21 | 0.21 | 0.55 | 1.00 |
We use the itertools module in python, to loop over the various combinations which have a positive coefficient and plot them along with a regularly fitting curve to show which variables indicate a possible or near relationship.
Without making difinitive assumptions about the data, we only want to visualize the possoble relationships so we can make some kind of indication on our own how these possible indicators may impact eachother. By plotting the relationships, you can also see the skewness in the points and for think about what posssible outliers exist.
means.continent
Afghanistan Asia
Albania Europe
Algeria Africa
Andorra Europe
Angola Africa
...
Venezuela South America
Vietnam Asia
Yemen Asia
Zambia Africa
Zimbabwe Africa
Name: continent, Length: 216, dtype: object
for name1, name2 in itertools.combinations(all_data, 2):
#plt.figure()
if "continent" in (name1, name2):
continue
print(name1, name2)
sns.lmplot(x=name1, y=name2, data=means, scatter_kws={'alpha':0.3})
#plt.scatter(means[name1], means[name2])
#plt.xlabel(name1)
#plt.ylabel(name2)
plt.title(f"{name1} vs {name2}")
plt.show()
co2 income
co2 life
co2 population
co2 gdp
income life
income population
income gdp
life population
life gdp
population gdp
A further analaysis could be made by continent and economic region..
# Once data has a column that you can group on (categorical data)
category_column = 'continent'
#unique_group_names = means[category_column].unique() #['asia', 'europe', 'north america', .e..] # dataframe[category_column].unique()
means.groupby(category_column).mean() # For each group (e.g. continent), generate the group mean
| co2 | income | life | population | gdp | |
|---|---|---|---|---|---|
| continent | |||||
| Africa | 0.81 | 2,254.09 | 48.32 | 17,583,126.60 | 13,020,080,268.93 |
| Asia | 5.97 | 8,693.91 | 51.96 | 64,927,276.17 | 130,028,426,139.28 |
| Europe | 4.63 | 11,442.06 | 60.21 | 9,733,677.60 | 89,904,554,434.61 |
| North America | 2.87 | 5,396.84 | 55.62 | 6,875,999.56 | 33,205,514,318.21 |
| Oceania | 3.12 | 4,837.50 | 52.13 | 1,860,593.01 | 13,152,763,290.91 |
| South America | 1.72 | 5,339.23 | 55.11 | 17,229,357.92 | 65,002,397,468.85 |
We will fist plot the means after importing the module plotly module to produce various figures to represent the data
import plotly.express as px
plot_means = means.reset_index().rename(columns={'index':'country'}).dropna()
fig = px.scatter(
plot_means,
x="population",
y="gdp",
size="co2", #np.sqrt(plot_means["gdp"]),
size_max = 40,
color="continent",
hover_name="country",
log_y=True,
log_x=True,
title="Total CO2 emissions per capita v. GDP (bubble Size = Population)")
fig.show()
fig1 = px.scatter(combined.query("year == '2010'").query("continent in ('North America', 'South America', 'Asia', 'Euope')").dropna(), x="gdp", y="co2",
size="population",
color="co2",
hover_name="country", log_x=True, size_max=50,
title="CO2 per capita v number of passangers per capita (bubble size = number of carrier departures)")
fig1.show()
fig3 = px.scatter(combined.query("year in ('2003')").dropna(), x="gdp", y="co2",
size="co2",
color="continent",
hover_name="country", log_x=True, size_max=60,
title="CO2 per capita v GSP (bubble size = consumption)")
fig3.show()
In this project we investigating CO2 emissions data provided by Gapminder and explore data sets of other know indicators such as life expectancy, population, gdp, and, income and compare those to flight data as provided by Gapminder to help us understand how known indicators of CO2 emissions may also influence our behavior of flying leading to aerospace industries contribution to the to overall global CO2 emissions.
In conclusion about the GapMider data set exploration - without implying causation from correlation - we can conclude that various drivers and directly related variables such as Population, Life Expectancy, Income and GDP show have an impact on the amount of CO2 (per capita) a country emits.
- The data suggest there is certainly a connection between the known variables Population, Life Expectancy, Income and GDP with the amount of CO2 is emmitted per capita per country
- The data also suggests that many other factors along with those knows drivers used in the analysis such as the various types of industry, energy prodcyiton or deforestation (logging / forestry industry) all would likely show similar relationships as the ones selected in this project.
- All variable inscluding flights and carrier flights trend in the same direction as other drivers such as income, and GDP. therefore the data could suggest that behavior is also a driving force as higher income, life expectency. etc. tend to result in higher number of passanger glights in general as shown in Figure 3
- Limitations exist due to the categorical nature of the dates when analyzing the data. Therefore we cannot have a very high level of statistical methodoology can be used other than basic correlations to showcase the nature of the potential relationshios between the variables.
- Another clear limitation is the fact that countries all develope at different rates as well as have cyclical data when it comes to merging developing countries causing an incocistancy when looking at regions which may not be equal when it comes to their economic impact or regional development.
- We can make staticis used here are descriptive statistics instead of inferential which would require a more scientific approach using a controlled experiments with a hypothesis rather than the exploritory inferences we make with our data.
plotly.offline.init_notebook_mode()